Method, Device, and Storage Medium for Evaluating Speech Quality

ABSTRACT

A method, a device and a storage medium for evaluating speech quality include: receiving speech data to be evaluated; extracting evaluation features of the speech data to be evaluated; performing quality evaluation to the speech data to be evaluated according to the evaluation features of the speech data to be evaluated and a predetermined speech quality evaluation model, in which the speech quality evaluation model is an indication of a relationship between evaluation features of single-ended speech data and quality information of the single-ended speech data.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is a continuation of International Application No.PCT/CN2016/111050, filed on Dec. 20, 2016, which claims priority toChinese Patent Application No. 201610892176.1, filed on Oct. 12, 2016,the contents of both of which are incorporated herein by reference intheir entireties.

TECHNICAL FIELD

The present disclosure relates to communication technology field, andmore particularly to a method for evaluating speech quality and a devicefor evaluating speech quality.

BACKGROUND

With the continuous development of technologies, the role ofcommunication becomes more important in people's life, such as speechdata transmission by communication network. Speech quality is animportant factor for evaluating quality of the communication network. Inorder to evaluate the speech quality, it is necessary to develop aneffective speech quality evaluation algorithm.

The speech quality evaluation in the communication network can beperformed using, for example, Perceptual Evaluation of Speech Quality(PESQ) algorithm and Perceptual Objective Listening Quality Analysis(POLQA) algorithm. In order to implement these algorithms, input speechdata and output speech data are often needed to be obtained. The inputspeech data is generally clean speech data and the output speech data isgenerally degraded speech data after passing the communication network.The output speech data may be evaluated by analyzing the input speechdata and the output speech data. The input speech data is generallycollected by road detection vehicles of operators. However, the inputspeech data cannot be obtained on the floors of residential area, mallsor other indoor conditions because the road detection vehicles cannotcollect in these conditions. Thus, speech quality evaluation cannot beperformed according to the input speech data, and the mentionedalgorithms based on the input speech data and output speech data toevaluate the speech quality of the output speech data have limitationsin application.

SUMMARY

Implementations of a first aspect of the present disclosure provide amethod for evaluating speech quality. The method includes: receivingspeech data to be evaluated; extracting evaluation features of thespeech data to be evaluated; and performing quality evaluation to thespeech data to be evaluated according to the evaluation features of thespeech data to be evaluated and a predetermined speech qualityevaluation model, in which the speech quality evaluation model is anindication a relationship between evaluation features of single-endedspeech data and quality information of the single-ended speech data.

Implementations of the present disclosure also provide a non-transitorycomputer-readable storage medium including instructions for evaluatingspeech quality, which instructions when executed by a processor becomeoperational with the processor to: receive speech data to be evaluated;extract evaluation features of the speech data to be evaluated; andperform quality evaluation to the speech data to be evaluated accordingto the evaluation features of the speech data to be evaluated and apredetermined speech quality evaluation model, in which the speechquality evaluation model is an indication of a relationship betweenevaluation features of single-ended speech data and quality informationof the single-ended speech data.

Implementations of the present disclosure also provide a device. Thedevice includes one or more processors; and a memory storing one or moreprograms which when executed by the one or more processors becomeoperational with the processor to: receive speech data to be evaluated;extract evaluation features of the speech data to be evaluated; andperform quality evaluation to the speech data to be evaluated accordingto the evaluation features of the speech data to be evaluated and apredetermined speech quality evaluation model, in which the speechquality evaluation model is an indication of a relationship betweenevaluation features of single-ended speech data and quality informationof the single-ended speech data.

Additional aspects and advantages of implementations of the presentdisclosure will be given in part in the following descriptions, becomeapparent in part from the following descriptions, or be learned from thepractice of the implementations of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects and advantages of implementations of the presentdisclosure will become apparent and more readily appreciated from thefollowing descriptions made with reference to the drawings, in which:

FIG. 1 is a flow chart of a method for evaluating speech qualityaccording to an implementation of the present disclosure;

FIG. 2 is a flow chart of a method for evaluating speech qualityaccording to another implementation of the present disclosure;

FIG. 3 is a schematic diagram of a device for evaluating speech qualityaccording to an implementation of the present disclosure;

FIG. 4 is a schematic diagram of a device for evaluating speech qualityaccording to another implementation of the present disclosure.

DETAILED DESCRIPTION

Reference will be made in detail to implementations of the presentdisclosure. Implementations of the present disclosure will be shown indrawings, in which the same or similar elements and the elements havingsame or similar functions are denoted by like reference numeralsthroughout the descriptions. The implementations described hereinaccording to drawings are explanatory and illustrative, not construed tolimit the present disclosure. On the contrary, implementations of thepresent disclosure include all the changes, alternatives, andmodifications falling into the scope of the spirit and principles of theattached claims.

In order to solve the problem in the PESQ algorithm and to adapt ademand of speech quality evaluation in 4G/LTE era, ITU-T (InternationalTelecommunication Union-Telecommunication Sector) began development workof the POLQA algorithm in 2006, which was officially issued as ITU-TP.863 standard in early 2011. Main features of the standard are to coverthe latest speech coding and network transmission technology. It hashigher accuracy and high quality to support ultra-wideband (50 Hz-14KHz) speech transmission in 3G, 4G/LTE and VoIP (Voice over InternetPhone) network. Therefore, the POLQA algorithm is a commonly algorithmselected to evaluate speech quality of communication network at present.

Deep Learning originates from Artificial Neural Network. A multi-layerperceptron with a plurality of hidden layers is a structure of the DeepLearning. The Deep Learning forms more abstract high layers to indicateattribute types or features by combining low layer features, so as tofind representation of distributed features of data. For example,application fields of the Deep Learning can include computer vision,acoustic model training of speech recognition, machine translation,semantic mining and other natural language processing fields.

As Deep Learning is still developing continuously, it is an objective ofthe present disclosure to use it in communication field, especially inspeech quality evaluation of the communication field.

For example, POLQA algorithm can be used for completing speech qualityevaluation. However, the inventor finds out that the POLQA algorithmneeds two-ended speech data, that is, when evaluating speech quality ofthe output speech data, it not only needs output speech data but alsoinput speech data. Because it is difficult to obtain input speech datain some cases, the application of the POLQA algorithm is limited. Inorder to prevent the problem of application limitation, it is necessaryto provide new solutions. By further analysis, the inventor finds outthat models predetermined by the Deep Learning have excellentperformance, so the Deep Learning may be introduced into speech qualityevaluation algorithms. Furthermore, to prevent the problem ofapplication limitation by adopting two-ended speech data, single-endedspeech data may only be used as a sample for training when models arepredetermined by the Deep Learning. Thus, only speech data to beevaluated is necessary as a single-ended when evaluating speech qualitywith a predetermined model.

Therefore, an objective of the present disclosure is to introduce theDeep Learning into the speech quality evaluation, especially into thespeech quality evaluation in communication field. By providing a newsolution only relying on single-ended speech data for the speech qualityevaluation in communication field, and by determining models via theDeep Learning when relying on the single-ended speech data only,excellent performance of the models may be ensured. Thus, the technicalproblems of the speech quality evaluation are solved with lesslimitation but better performance. Furthermore, it should be noted that,specific technical solutions are not limited to the implementationsdisclosed herein. The features described below may be combined withother features, and these combinations still belong to the scope of thepresent disclosure.

Furthermore, it should be noted that, although some technical problemssolved by the present disclosure have been given above, the presentdisclosure is not limited to solving the technical problems above. Otherproblems solved by the present disclosure still belong to the scope ofthe present disclosure.

Furthermore, it should be noted that, although the main ideas of thepresent disclosure have been given above, and some special points willbe illustrated in subsequent implementations, innovative points of thepresent disclosure are not limited to the content involved in the mainideas above and the special points. It does not eliminate that somecontents of the present disclosure not specially illustrated may stillinclude innovative points of the present disclosure.

It should be understood that, although some technical solutions aredescribed, some other possible technical solutions are not eliminated.Therefore, the technical solutions given by same, similar, equal orother cases of implementations of the present disclosure still belong tothe scope of the present disclosure.

The technical solutions of the present disclosure will be illustratedwith reference to specific implementations as follows.

FIG. 1 is a flow chart of a method for evaluating speech qualityaccording to an implementation of the present disclosure.

As shown in FIG. 1, the method in the implementation includesfollowings.

At S11, speech data to be evaluated is received.

Taking the communication field as an example, the speech data to beevaluated may specifically refer to output speech data of thecommunication network, i.e. degraded speech data after input speech datapasses the communication network. The input speech data generally refersto clean speech data or original speech data, and the degreed speechdata generally refers to speech data with quality degradation (such asone or more terms of degradation of clarity, delay, noise etc.) withrespect to the original speech data.

At S12, evaluation features of the speech data to be evaluated areextracted.

The evaluation features of the speech data to be evaluated are the samewith evaluation features extracted from degreed speech data when aspeech quality evaluation model is predetermined, which may bedetermined according to application demands.

In general, the evaluation features refer to features describing thespeech data from a perspective of human auditory perception, which mayrefer to subsequent description.

At S13, quality evaluation is performed to the speech data to beevaluated according to the evaluation features of the speech data to beevaluated and a predetermined speech quality evaluation model, in whichthe speech quality evaluation model is an indication of a relationshipbetween evaluation features of single-ended speech data and qualityinformation of the single-ended speech data.

The speech quality evaluation model may be predetermined before thespeech quality evaluation is required. For example, the speech qualityevaluation model may be predetermined offline, thus the speech qualityevaluation model may be used directly when the speech quality evaluationis required. Certainly, it is not eliminated that the speech qualityevaluation model is predetermined online, such as predetermined onlinewhen the speech quality evaluation is required. Specific determiningprocess of the speech quality evaluation model may refer to subsequentdescription.

Input and output of the speech quality evaluation model are evaluationfeatures and quality information of the single-ended speech datarespectively. Therefore, after extracting the evaluation features of thespeech data to be evaluated, the evaluation features of the speech datato be evaluated may be taken as the input of the speech qualityevaluation model, thus the output obtained from the speech qualityevaluation model is the quality information of the speech data to beevaluated, and then the speech quality evaluation is realized.

Furthermore, the speech quality evaluation model may be described by aregression model or by a classification model. The quality informationmentioned above may be different when the speech quality evaluationmodel is described in different cases. For instance, if the speechquality evaluation model is described by the regression model, thequality information obtained is a specific evaluation score, such as ascore among 1-5. If the speech quality evaluation model is described bythe classification model, the quality information obtained is anevaluation classification, such as a classification among worse, bad,common, good and better.

Furthermore, in some implementations, in order to improve accuracy ofthe speech quality evaluation, normalization may be performed to aresult of the speech quality evaluation obtained in S13. Taking theresult of the quality evaluation being the evaluation score as anexample, the evaluation score obtained in S13 may be taken as a finalevaluation score directly in the normalization, alternatively, theevaluation score obtained in S13 may be normalized according to packetloss, jitter, delay and other related parameters of the communicationnetwork to obtain the final evaluation score. Algorithm for normalizingaccording to the parameters of communication network may be set, whichis not described in detail herein. For instance, the evaluation scoreobtained in S13 may be multiplied by a coefficient as the finalevaluation score, in which the coefficient is related to the aboveparameters of the communication network.

In the implementation, by performing the quality evaluation to thespeech data to be evaluated via the speech quality evaluation model, itonly needs the single-ended speech data for speech quality evaluation,and it can prevent a problem of limitation in applications caused byrelying on two-ended speech data, thus expanding the scope ofapplications.

FIG. 2 is a flow chart of a method for evaluating speech qualityaccording to another implementation of the present disclosure.

In this implementation, the degraded speech data after passing thecommunication network is taken as an example of the speech data to beevaluated. Determining by the Deep Learning is taken as an example fordetermining the speech quality evaluation model.

As shown in FIG. 2, the method in the implementation includesfollowings.

At S21, speech data is obtained, in which the speech data includes cleanspeech data and degraded speech data.

The speech data may be obtained by at least one way of collecting andobtaining directly from existing data. In order to improve accuracy ofthe determined speech quality evaluation model, as much speech data aspossible should be obtained here.

The way of collecting, for instance, when collecting the speech data,the clean speech data and the degraded speech data after passing thecommunication network may be obtained respectively by simulating thecommunication. Specifically, a large amount of clean speech data iscollected from a high fidelity studio firstly, such as 2000 hours ofclean speech data. Then by using multiple mobile phones to simulatecalls, that is, one mobile phone is used to make a call to play theclean speech data, and the other mobile phone is used to receive theclean speech data, the degraded speech data after passing thecommunication network is obtained by restoring transmitted data packetsat different interfaces of the communication network.

Certainly, the clean speech data and the degraded speech data may beobtained respectively by collecting speech data from real network callsdirectly, method of which is not limited in the present disclosure.

Furthermore, the clean speech data and the degraded speech data may becollected separately when collecting the speech data, thus the cleanspeech data and the degraded speech data may directly be obtainedrespectively. Alternatively, the clean speech data and the degradedspeech data may be collected together while marked respectively todistinguish the clean speech data form the degraded speech data whencollecting the speech data. For instance, a marker 1 may be used torepresent the clean speech data, and a marker 0 may be used to representthe degraded speech data, the clean speech data and the degraded speechdata may be obtained according to the markers.

At S22, clean speech data to be processed is obtained according to theclean speech data, and degraded speech data to be processed is obtainedaccording to the degraded speech data.

S22 may include at least one of:

using the degraded speech data as the degraded speech data to beprocessed;

extracting valid speech segments from the degraded speech data, andusing the valid speech segments of the degraded speech data as thedegraded speech data to be processed;

clustering the degraded speech data, and using degraded speech datacorresponding to first cluster centers as the degraded speech data to beprocessed; and

extracting valid speech segments from the degraded speech data,clustering the valid speech segments of the degraded speech data, andusing valid speech segments corresponding to second cluster centers asthe degraded speech data to be processed.

Specifically, after obtaining the clean speech data and the degradedspeech data, the clean speech data obtained and the degraded speech dataobtained may be directly taken as the clean speech data to be processedand the degraded speech data to be processed respectively. Furthermore,after obtaining the clean speech data and the degraded speech data,valid speech segments may be extracted for the clean speech data and thedegraded speech data respectively. The valid speech segments of theclean speech data are taken as the clean speech data to be processed,and the valid speech segments of the degraded speech data are taken asthe degraded speech data to be processed. Specific ways for extractingthe valid speech segments are not limited, for instance, the way ofVoice Activity Detection (VAD). By processing the valid speech segmentsonly, computation and complexity may be reduced.

Furthermore, when obtaining the degraded speech data to be processed,all the degraded speech data included in the speech data or all thevalid speech segments of the degraded speech data may be taken as thedegraded speech data to be processed. Alternatively, part of thedegraded speech data or its valid speech segments may be selected as thedegraded speech data to be processed. When selecting, cluster method maybe used, in which all the degraded speech data or its valid speechsegments may be clustered to obtain cluster centers, and the degradedspeech data or the valid speech segments corresponding to the clustercenters are taken as the degraded speech data to be processed.

For instance, when clustering, ivector features of the valid speechsegments of the degraded speech data are extracted. The ivector featuresextracted are clustered by k-means method to obtain k cluster centers.The degraded speech data or the valid speech segments corresponding toeach of the k cluster centers are taken as the degraded speech date tobe processed. By clustering and by selecting the degraded speech datacorresponding to the cluster centers to process, computation may bereduced and computation efficiency may be improved.

At S23, an evaluation score of the degraded speech data to be processedis computed based on the clean speech data to be processed and thedegraded speech data to be processed.

The valid speech segments being the data to be processed, for instance,after obtaining the valid speech segments of the clean speech data andthe valid speech segment of the degraded speech data, each valid speechsegment of the degraded speech data is analyzed frame-by-frame accordingto the valid speech segments of the clean speech data to compute theevaluation score of the valid speech segments of the degraded speechdata. The method for computing the evaluation score is not limited, forinstance, the evaluation score may be a mean opinion score (MOS) of thespeech data, specific computing method of which may be the same withthat in the related art. For instance, the evaluation score may beobtained by the POLQA algorithm or the PESQ algorithm, which are notdescribed in detail herein.

At S24, evaluation features of the degraded speech data to be processedare extracted.

The evaluation features describe the speech data from the perspective ofhuman auditory perception. Specifically, time domain features of thedegraded speech data to be processed are extracted firstly, such asshort-time average energy of the speech data, segmented background noiseof the speech data, short time wave shock or shake of the speech data,fundamental frequency features and difference features of thefundamental frequency (such as, first or second order difference valueof the fundamental frequency), etc. Then spectral features of thedegraded speech data to be processed are obtained. The spectral featuresof the degraded speech data to be processed are extracted, such asFilterBank features and linear predictive coding (LPC) features, etc. Afilter with cochlear shape which can describe the speech data from theperspective of human auditory perception is used to extract the spectralfeatures, thus the spectral features extracted can describe the speechdata from the perspective of human auditory perception. In order todescribe the degraded speech data to be processed better, mean value,variance, maximum value, minimum value and difference features (such as,first or second order difference), etc., of each spectral feature may beextracted. Which evaluation feature to be extracted may be determinedaccording to application demands and degraded case of the speech data,which is not limited in the present disclosure.

At S25, the speech quality evaluation model is determined by trainingaccording to the evaluation features of the degraded speech data to beprocessed and the evaluation score of the degraded speech data to beprocessed.

The Deep Learning technique is used for training to obtain theparameters of the speech quality evaluation model, thus the speechquality evaluation model is determined.

Network topology used by the Deep Learning method may be a combinationof one or more of Deep Neural Networks (DNN), Convolutional NeuralNetworks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory(LSTM) Neural Networks or other Neural Networks, which is notspecifically limited in the present disclosure. The selection of thenetworks may be determined by application demands. Training processafter obtaining the input and output of the model is the same with thatin the related art, which is not described in detail herein.

Furthermore, the speech quality evaluation model may be described bydifferent types of models, such as the regression model or theclassification model. The input and output corresponding to the speechquality evaluation model may be adjusted accordingly when the speechquality evaluation model is described by different types of models.

In detail, based on a determination that the speech quality evaluationmodel includes a regression model, the obtained evaluation features ofthe degraded speech data to be processed are used as inputs of thespeech quality evaluation model and the evaluation score of the degradedspeech data to be processed is used as an output of the speech qualityevaluation model.

Based on a determination that the speech quality evaluation modelincludes a classification model, the obtained evaluation features of thedegraded speech data to be processed are used as inputs of the speechquality evaluation model, and the evaluation score of the degradedspeech data to be processed is quantized to obtain evaluationclassifications, and the evaluation classifications are used as outputsof the speech quality evaluation model.

Specifically, when quantizing the evaluation score, a fixed step or anunfixed step may be used to quantize the evaluation score of thedegraded speech data to be processed. For instance, if the fixed step isused, the evaluation scores of all the degraded speech data to beprocessed are quantized by the fixed step being 0.2 to obtainclassification of each degraded speech data to be processed. Taking aMOS score as an example, when quantizing with a step of 0.2, 20evaluation classifications may be obtained by quantizing scores of 1-5.If the unfixed step is used, a quantization step of the evaluation scorein each scope of the degraded speech data to be processed may bedetermined according to application demands. For instance, a large stepmay be used to quantize for a scope with lower evaluation score, a smallstep may be used to quantize for a scope with higher evaluation score.Taking the MOS score as an example, a large step, 0.5 for instance, maybe used to quantize as the score scope from 1 to 3 is a scope with lowerscore, a small step, 0.2 for instance, may be used to quantize as thescore scope from 3 to 5 is a scope with higher score, and a total of 14evaluation classifications may be obtained after quantizing.

Certainly, other method may also be used to quantize the evaluationscore to divide the evaluation score into a plurality of evaluationclassifications, such as the classification among worse, bad, common,good and better, which is not limited in the present disclosure.

At S26, degraded speech data after passing the communication network isreceived.

At S27, evaluation features of the degraded speech data after passingthe communication network are extracted.

The method for extracting the evaluation features of the degraded speechdata after passing the communication network is the same as that in thetraining process, which is not described in detail herein.

At S28, quality evaluation is performed to the degraded speech dataafter passing the communication network according to the evaluationfeatures and the speech quality evaluation model.

Specifically, the evaluation features of the current degraded speechdata are taken as the input of the speech quality evaluation model, andthe output of the speech quality evaluation model is taken as a qualityevaluation result of the current degraded speech data. If the speechquality evaluation model is described by the regression model, thequality evaluation result will be the evaluation score. If the speechquality evaluation model is described by the classification model, thequality evaluation result will be the evaluation classification.

In the implementation, by performing the quality evaluation to thespeech data to be evaluated via the speech quality evaluation model, itonly needs the single-ended speech data for speech quality evaluation,and it can prevent a problem of limitation in applications caused byrelying on two-ended speech data, thus expanding the scope ofapplications. Furthermore, by training with the Deep Leaning method,excellent performance of the Deep Leaning method may be used therebymaking the speech quality evaluation model more accurate, thus, thespeech quality evaluation results are more accurate. Furthermore, byperforming the quality evaluation to speech data in the communicationfield, the Deep Learning may be combined with the speech qualityevaluation in the communication field, and new solutions for speechquality evaluation in the communication field is provided.

FIG. 3 is a schematic diagram of a device for evaluating speech qualityaccording to an implementation of the present disclosure.

As shown in FIG. 3, the device 30 according to the implementationincludes a receiving module 31, an extracting module 32 and anevaluating module 33.

The receiving module 31 is configured to receive speech data to beevaluated.

The extracting module 32 is configured to extract evaluation features ofthe speech data to be evaluated.

The evaluating module 33 is configured to perform quality evaluation tothe speech data to be evaluated according to the evaluation features ofthe speech data to be evaluated and a built speech quality evaluationmodel, in which the speech quality evaluation model is configured toindicate a relationship between evaluation features of single-endedspeech data and quality information of the single-ended speech data.

In some implementations, the speech data to be evaluated includesdegraded speech data after passing communication network.

In some implementations, as shown in FIG. 4, the device 30 according tothe implementation further includes a building module 34, which isconfigured to build the speech quality evaluation model. The buildingmodule 34 includes the following.

A first obtaining sub module 341 configured to obtain speech data, inwhich the speech data includes clean speech data and degraded speechdata;

A second obtaining sub module 342 is configured to obtain clean speechdata to be processed according to the clean speech data, and to obtaindegraded speech data to be processed according to the degraded speechdata.

A computing sub module 343 is configured to compute an evaluation scoreof the degraded speech data to be processed according to the cleanspeech data to be processed and the degraded speech data to beprocessed.

An extracting sub module 344 is configured to extract evaluationfeatures of the degraded speech data to be processed.

A training sub module 345 is configured to build the speech qualityevaluation model by training according to the evaluation features of thedegraded speech data to be processed and the evaluation score of thedegraded speech data to be processed.

In some implementations, the speech quality evaluation model is built bytraining with Deep Learning.

In some implementations, the training sub module 345 is furtherconfigured to: if the speech quality evaluation model is described by aregression model, take the evaluation features of the degraded speechdata to be processed and the evaluation score of the degraded speechdata to be processed as model inputs and model output respectively,train model parameters, and build the speech quality evaluation model;if the speech quality evaluation model is described by a classificationmodel, take the evaluation features of the degraded speech data to beprocessed as model inputs, quantize the evaluation score of the degradedspeech data to be processed to obtain evaluation classifications, takethe evaluation classifications as model outputs, train model parameters,and build the speech quality evaluation model.

In some implementations, the second obtaining sub module 342 isconfigured to obtain clean speech data to be processed according to theclean speech data, by acts of: taking the clean speech data as the cleanspeech data to be processed directly; or extracting valid speechsegments of the clean speech data, and taking the valid speech segmentsof the clean speech data as the clean speech data to be processed.

In some implementations, the second obtaining sub module 342 isconfigured to obtain degraded speech data to be processed according tothe degraded speech data, by acts of: taking the degraded speech data asthe degraded speech data to be processed directly; or extracting validspeech segments of the degraded speech data, and taking the valid speechsegments of the degraded speech data as the degraded speech data to beprocessed; or clustering the degraded speech data, and taking degradedspeech data corresponding to first cluster centers as the degradedspeech data to be processed; or extracting valid speech segments of thedegraded speech data, clustering the valid speech segments of thedegraded speech data, and taking valid speech segments corresponding tosecond cluster centers as the degraded speech data to be processed.

It should be understood that the device according to the implementationcorresponds to the method implementations above, and the details thereofcan be referred to the description of the method implementations, whichis not described in detail herein.

In the implementation, by performing the quality evaluation to thespeech data to be evaluated via the speech quality evaluation model, itonly needs the single-ended speech data for speech quality evaluation,and it can prevent a problem of limitation in applications caused byrelying on two-ended speech data, thus expanding the scope ofapplications.

Implementations of the present disclosure also provide a device. Thedevice includes: one or more processors; and a memory configured tostore one or more programs. When the one or more programs are executedby the one or more processors, the one or more processors are caused toperform the method including: receiving speech data to be evaluated;extracting evaluation features of the speech data to be evaluated; andperforming quality evaluation to the speech data to be evaluatedaccording to the evaluation features of the speech data to be evaluatedand a built speech quality evaluation model, in which the speech qualityevaluation model is configured to indicate a relationship betweenevaluation features of single-ended speech data and quality informationof the single-ended speech data.

Implementations of the present disclosure also provide a non-transitorycomputer-readable storage medium having stored therein one or moreprograms that, when executed by one or more processors, causes the oneor more processors to perform the method including: receiving speechdata to be evaluated; extracting evaluation features of the speech datato be evaluated; and performing quality evaluation to the speech data tobe evaluated according to the evaluation features of the speech data tobe evaluated and a built speech quality evaluation model, in which thespeech quality evaluation model is configured to indicate a relationshipbetween evaluation features of single-ended speech data and qualityinformation of the single-ended speech data.

Implementations of the present disclosure also provide a computerprogram product. When the computer program product is executed by one ormore processors of a device, the one or more processors are caused toperform the method including: receiving speech data to be evaluated;extracting evaluation features of the speech data to be evaluated; andperforming quality evaluation to the speech data to be evaluatedaccording to the evaluation features of the speech data to be evaluatedand a built speech quality evaluation model, in which the speech qualityevaluation model is configured to indicate a relationship betweenevaluation features of single-ended speech data and quality informationof the single-ended speech data.

It should be understood that the same or similar parts in the aboveimplementations may refer to each other, and the contents not describedin detail in some implementations may refer to the same or similarcontents in other implementations.

It is to be noted that, in the description of the present disclosure,terms of “first” and “second” are only used for description and cannotbe seen as indicating or implying relative importance. Furthermore, inthe description of the present disclosure, unless otherwise explained,it is to be understood that a term of “a plurality of” refers to two ormore.

Any procedure or method described in the flow charts or described in anyother way herein may be understood to comprise one or more modules,portions or parts for storing executable codes that realize particularlogic functions or procedures. Moreover, advantageous implementations ofthe present disclosure comprise other implementations in which the orderof execution is different from that which is depicted or discussed,including executing functions in a substantially simultaneous manner orin an opposite order according to the related functions. This should beunderstood by those skilled in the art which implementations of thepresent disclosure belong to.

It should be understood that each part of the present disclosure may berealized by the hardware, software, firmware or their combination. Inthe above implementations, a plurality of steps or methods may berealized by the software or firmware stored in the memory and executedby the appropriate instruction execution system. For example, if it isrealized by the hardware, likewise in another implementation, the stepsor methods may be realized by one or a combination of the followingtechniques known in the art: a discrete logic circuit having a logicgate circuit for realizing a logic function of a data signal, anapplication-specific integrated circuit having an appropriatecombination logic gate circuit, a programmable gate array (PGA), a fieldprogrammable gate array (FPGA), etc.

Those skilled in the art shall understand that all or parts of the stepsin the above exemplifying method of the present disclosure may beachieved by commanding the related hardware with programs. The programsmay be stored in a computer readable storage medium, and the programscomprise one or a combination of the steps in the method implementationsof the present disclosure when run on a computer.

In addition, each function cell of the implementations of the presentdisclosure may be integrated in a processing module, or these cells maybe separate physical existence, or two or more cells are integrated in aprocessing module. The integrated module may be realized in a form ofhardware or in a form of software function modules. When the integratedmodule is realized in a form of software function module and is sold orused as a standalone product, the integrated module may be stored in acomputer readable storage medium.

The above-mentioned memory medium may be a read-only memory, a magneticdisc, or an optical disc, etc.

Reference throughout this specification to “an implementation,” “someimplementations,” “an example,” “a specific example,” or “someexamples,” means that a particular feature, structure, material, orcharacteristic described in connection with the implementation orexample is included in at least one implementation or example of thepresent disclosure. The appearances of the phrases throughout thisspecification are not necessarily referring to the same implementationor example of the present disclosure. Furthermore, the particularfeatures, structures, materials, or characteristics may be combined inany suitable manner in one or more implementations or examples.

Although explanatory implementations have been shown and described, itwould be appreciated by those skilled in the art that the aboveimplementations cannot be construed to limit the present disclosure, andchanges, alternatives, and modifications can be made in theimplementations without departing from spirit, principles and scope ofthe present disclosure.

What is claimed is:
 1. A method for evaluating speech quality,comprising: receiving speech data to be evaluated; extracting evaluationfeatures of the speech data to be evaluated; and performing, by aprocessor, quality evaluation to the speech data to be evaluatedaccording to the evaluation features of the speech data to be evaluatedand a predetermined speech quality evaluation model, wherein the speechquality evaluation model is an indication of a relationship betweenevaluation features of single-ended speech data and quality informationof the single-ended speech data.
 2. The method of claim 1, wherein thespeech data to be evaluated comprises first degraded speech data afterpassing a communication network.
 3. The method of claim 2, furthercomprising: determining the speech quality evaluation model, whereindetermining the speech quality evaluation model comprises: obtainingspeech data, wherein the speech data comprises clean speech data andsecond degraded speech data; obtaining clean speech data to be processedaccording to the clean speech data, and obtaining degraded speech datato be processed according to the second degraded speech data; computingan evaluation score of the degraded speech data to be processed based onthe clean speech data to be processed and the degraded speech data to beprocessed; extracting evaluation features of the degraded speech data tobe processed; and determining the speech quality evaluation model bytraining according to the evaluation features of the degraded speechdata to be processed and the evaluation score of the degraded speechdata to be processed.
 4. The method of claim 1, wherein the speechquality evaluation model is determined by training using a Deep Learningtechnique.
 5. The method of claim 4, wherein determining the speechquality evaluation model by training according to the evaluationfeatures of the degraded speech data to be processed and the evaluationscore of the degraded speech data to be processed comprises one of:based on a determination that the speech quality evaluation modelcomprises a regression model: using the evaluation features of thedegraded speech data to be processed as inputs of the speech qualityevaluation model and the evaluation score of the degraded speech data tobe processed as an output of the speech quality evaluation model,training parameters of the speech quality evaluation model, anddetermining the speech quality evaluation model; and based on adetermination that the speech quality evaluation model comprises aclassification model: using the evaluation features of the degradedspeech data to be processed as inputs of the speech quality evaluationmodel, quantizing the evaluation score of the degraded speech data to beprocessed to obtain evaluation classifications, using the evaluationclassifications as outputs of the speech quality evaluation model,training the parameters of the speech quality evaluation model, anddetermining the speech quality evaluation model.
 6. The method of claim3, wherein obtaining the clean speech data to be processed according tothe clean speech data comprises one of: using the clean speech data asthe clean speech data to be processed; and extracting valid speechsegments from the clean speech data, and using the valid speech segmentsof the clean speech data as the clean speech data to be processed. 7.The method of claim 3, wherein obtaining the degraded speech data to beprocessed according to the second degraded speech data comprises one of:using the second degraded speech data as the degraded speech data to beprocessed; extracting valid speech segments from the second degradedspeech data, and using the valid speech segments of the second degradedspeech data as the degraded speech data to be processed; clustering thesecond degraded speech data, and using degraded speech datacorresponding to first cluster centers as the degraded speech data to beprocessed; and extracting valid speech segments from the second degradedspeech data, clustering the valid speech segments of the second degradedspeech data, and using valid speech segments corresponding to secondcluster centers as the degraded speech data to be processed.
 8. Anon-transitory computer-readable storage medium, comprising instructionsfor evaluating speech quality, which instructions when executed by aprocessor become operational with the processor to: receive speech datato be evaluated; extract evaluation features of the speech data to beevaluated; and perform quality evaluation to the speech data to beevaluated according to the evaluation features of the speech data to beevaluated and a predetermined speech quality evaluation model, whereinthe speech quality evaluation model is an indication of a relationshipbetween evaluation features of single-ended speech data and qualityinformation of the single-ended speech data.
 9. The non-transitorycomputer-readable storage medium of claim 8, wherein the speech data tobe evaluated comprises first degraded speech data after passing acommunication network.
 10. The non-transitory computer-readable storagemedium of claim 9, further comprising instructions which when executedby the processor become operational with the processor to: obtain speechdata, wherein the speech data comprises clean speech data and seconddegraded speech data; obtain clean speech data to be processed accordingto the clean speech data, and obtain degraded speech data to beprocessed according to the second degraded speech data; compute anevaluation score of the degraded speech data to be processed based onthe clean speech data to be processed and the degraded speech data to beprocessed; extract evaluation features of the degraded speech data to beprocessed; and determine the speech quality evaluation model by trainingaccording to the evaluation features of the degraded speech data to beprocessed and the evaluation score of the degraded speech data to beprocessed.
 11. The non-transitory computer-readable storage medium ofclaim 8, wherein the speech quality evaluation model is determined bytraining using a Deep Learning technique.
 12. The non-transitorycomputer-readable storage medium of claim 11, wherein the instructionsoperational with the processor to determine the speech qualityevaluation model by training according to the evaluation features of thedegraded speech data to be processed and the evaluation score of thedegraded speech data to be processed further comprise instructions whichwhen executed by the processor become operational with the processor to:based on a determination that the speech quality evaluation modelcomprises a regression model: use the evaluation features of thedegraded speech data to be processed as inputs of the speech qualityevaluation model and the evaluation score of the degraded speech data tobe processed as an output of the speech quality evaluation model, trainparameters of the speech quality evaluation model, and determine thespeech quality evaluation model; or based on a determination that thespeech quality evaluation model comprises a classification model: usethe evaluation features of the degraded speech data to be processed asinputs of the speech quality evaluation model, quantize the evaluationscore of the degraded speech data to be processed to obtain evaluationclassifications, use the evaluation classifications as outputs of thespeech quality evaluation model, train parameters of the speech qualityevaluation model, and determine the speech quality evaluation model. 13.The non-transitory computer-readable storage medium of claim 10, whereinthe instructions operational with the processor to obtain the cleanspeech data to be processed according to the clean speech data furthercomprise instructions which when executed by the processor becomeoperational with the processor to: use the clean speech data as theclean speech data to be processed; or extract valid speech segments fromthe clean speech data, and use the valid speech segments of the cleanspeech data as the clean speech data to be processed.
 14. Thenon-transitory computer-readable storage medium of claim 10, wherein theinstructions operational with the processor to obtain the degradedspeech data to be processed according to the second degraded speech datafurther comprise one of: instructions which when executed by theprocessor become operational with the processor to use the seconddegraded speech data as the degraded speech data to be processed;instructions which when executed by the processor become operationalwith the processor to extract valid speech segments from the seconddegraded speech data, and use the valid speech segments of the seconddegraded speech data as the degraded speech data to be processed;instructions which when executed by the processor become operationalwith the processor to cluster the second degraded speech data, and usedegraded speech data corresponding to first cluster centers as thedegraded speech data to be processed; and instructions which whenexecuted by the processor become operational with the processor toextract valid speech segments from the second degraded speech data,cluster the valid speech segments of the second degraded speech data,and use valid speech segments corresponding to second cluster centers asthe degraded speech data to be processed.
 15. A device, comprising: oneor more processors; and a memory storing one or more programs which whenexecuted by the one or more processors become operational with the oneor more processors to: receive speech data to be evaluated; extractevaluation features of the speech data to be evaluated; and performquality evaluation to the speech data to be evaluated according to theevaluation features of the speech data to be evaluated and apredetermined speech quality evaluation model, wherein the speechquality evaluation model is an indication of a relationship betweenevaluation features of single-ended speech data and quality informationof the single-ended speech data.
 16. The device of claim 15, wherein thespeech data to be evaluated comprises first degraded speech data afterpassing a communication network.
 17. The device of claim 16, wherein thememory further comprises one or more programs which when executed by theone or more processors become operational with the one or moreprocessors to: determine the speech quality evaluation model, whereinthe memory storing the one or more programs which when executed by theone or more processors become operational with the one or moreprocessors to determine the speech quality evaluation model furthercomprises one or more programs which when executed by the one or moreprocessors become operational with the one or more processors to: obtainspeech data, wherein the speech data comprises clean speech data andsecond degraded speech data; obtain clean speech data to be processedaccording to the clean speech data, and obtain degraded speech data tobe processed according to the second degraded speech data; compute anevaluation score of the degraded speech data to be processed based onthe clean speech data to be processed and the degraded speech data to beprocessed; extract evaluation features of the degraded speech data to beprocessed; and determine the speech quality evaluation model by trainingaccording to the evaluation features of the degraded speech data to beprocessed and the evaluation score of the degraded speech data to beprocessed.
 18. The device of claim 15, wherein the speech qualityevaluation model is determined by training using a Deep Learningtechnique, wherein the memory storing the one or more programsoperational with the one or more processors to determine the speechquality evaluation model by training according to the evaluationfeatures of the degraded speech data to be processed and the evaluationscore of the degraded speech data to be processed further comprises oneor more programs which when executed by the one or more processorsbecome operational with the one or more processors to: based on adetermination that the speech quality evaluation model comprises aregression model: use the evaluation features of the degraded speechdata to be processed as inputs of the speech quality evaluation modeland the evaluation score of the degraded speech data to be processed asan output of the speech quality evaluation model, train parameters ofthe speech quality evaluation model, and determine the speech qualityevaluation model; or based on a determination that the speech qualityevaluation model comprises a classification model: use the evaluationfeatures of the degraded speech data to be processed as inputs of thespeech quality evaluation model, quantize the evaluation score of thedegraded speech data to be processed to obtain evaluationclassifications, use the evaluation classifications as outputs of thespeech quality evaluation model, train parameters of the speech qualityevaluation model, and determine the speech quality evaluation model. 19.The device of claim 17, wherein the memory storing the one or moreprograms operational with the one or more processors to obtain the cleanspeech data to be processed according to the clean speech data furthercomprises one or more programs which when executed by the one or moreprocessors become operational with the one or more processors to: usethe clean speech data as the clean speech data to be processed; orextract valid speech segments from the clean speech data, and use thevalid speech segments of the clean speech data as the clean speech datato be processed; or wherein the memory storing the one or more programsoperational with the one or more processors to obtain the degradedspeech data to be processed according to the second degraded speech datafurther comprises one or more programs which when executed by the one ormore processors become operational with the one or more processors to:use the second degraded speech data as the degraded speech data to beprocessed; extract valid speech segments from the second degraded speechdata, and use the valid speech segments of the second degraded speechdata as the degraded speech data to be processed; cluster the seconddegraded speech data, and use degraded speech data corresponding tofirst cluster centers as the degraded speech data to be processed; orextract valid speech segments from the second degraded speech data,cluster the valid speech segments of the second degraded speech data,and use valid speech segments corresponding to second cluster centers asthe degraded speech data to be processed.
 20. The device of claim 18,wherein the memory storing the one or more programs operational with theone or more processors to quantize the evaluation score of the degradedspeech data to be processed to obtain evaluation classifications furthercomprises one or more programs which when executed by the one or moreprocessors become operational with the one or more processors to:quantize, using a fixed step value, the evaluation score of the degradedspeech data to be processed to obtain evaluation classifications; orquantize, using unfixed step values, the evaluation score of thedegraded speech data to be processed to obtain evaluationclassifications, wherein the unfixed step values are determined forscopes of the evaluation scores based on application demands.